Time is everywhere. Whether we wish to predict the trend in financial markets or electricity consumption, or predicting any stoke market data, time is an important factor that must now be considered in our models.
So What is time series then?
Time series is simply a series of data points ordered in time. Time series analysis is a statistical technique that deals with time series data, or trend analysis.
The data is considered in three types:
Time series Data: A set of observations on the values that a variable takes at different times.
Cross-sectional data: Data of one or more variables, collected at the same point in time.
Pooled Data: A combination of time series data and cross-sectional data.
The other aspects comes into play when dealing with time is:
Is it Stationary?
Is there a Seasonality?
Is the target variable autocorrealted?
We will know them one by one:
Informally, autocorrelation is the similarity between observations as a function of the time lag between them.
Briefly we can say that Autocorrelation is a characteristic of data which shows the degree of similarity between the values of the same variables over successive time intervals.
Above is an example of an autocorrelation plot. Looking closely you will realize that the first value and 24th value have a high correalation and similarly 12th and 36th observations are highly correlated. This means that we will find a very similar value at every 24 unit of time.
Now let’s see autocorrelation test in practical:
#loading data
stock_data <- read.csv("stock_data.csv")
aapl_prices <- stock_data$AAPL
aapl_prices_ts <- ts(aapl_prices) #converting it to a time series
plot.ts(aapl_prices_ts,main="Stock Prices",ylab="Prices")
Finding Autocorrelation using ACF function: The function acf computes estimates of the autocovariance or autocorrelation function for different time lags.
Below we get the the autocorrelations for lag 1 to 10. Notice that the correlation keeps reducing as the lag increases.
acf(aapl_prices_ts, lag.max = 10, plot = FALSE)
##
## Autocorrelations of series 'aapl_prices_ts', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 0.987 0.975 0.963 0.952 0.940 0.928 0.916 0.904 0.892 0.880
We can also get the same information in an acf plot as shown below. This is also called a correlogram, also knows as an autocorrelation plot.
acf(aapl_prices_ts, lag.max = 10)
Seasonality refers to periodic fluctuations. For example, electricity consumption is high during the day and low during night, or online sales increase during christmas before slowing down again. As you can see that there is a clear seasonality and you can see a peak towards the evening and the lowest point are the beginning and the end of each day.
Now we will see a practical example of seasonal series in R:
set.seed(5)
x1 = 1:96/20 + ts(rnorm(96, 100, 1), start=c(2015,1), frequency=12)
x2 = 1:96/20 + ts(rnorm(96, 100, 1) + ts(sin((2*pi*rep(1:12, 8))/12), frequency=12), start=c(2015,1), frequency=12)
ts.plot(x1, x2, col = c("darkblue", "darkred"), main = "Simple Seasonal Series")
legend("topleft", c("Non seasonal series", "Seasonal series"), col=c("darkblue", "darkred"), lty=1)
Now we can test if they are seasonal or not:
print("Testing the non-seasonal series")
## [1] "Testing the non-seasonal series"
summary(wo(x1))
## Test used: WO
##
## Test statistic: 0
## P-value: 1 1 0.179945
##
## The WO - test does not identify seasonality
print("")
## [1] ""
print("Testing the seasonal series")
## [1] "Testing the seasonal series"
summary(wo(x2))
## Test used: WO
##
## Test statistic: 1
## P-value: 1 0.005145348 0.001151393
##
## The WO - test identifies seasonality
If we are only interested in knowing, whether a series is seasonal or not not, e.g. for further use in our analysis, the function isSeasonal() can be called.
print("Test using the non-seasonal series")
## [1] "Test using the non-seasonal series"
isSeasonal(x1)
## [1] FALSE
print("Test using the seasonal series")
## [1] "Test using the seasonal series"
isSeasonal(x2)
## [1] TRUE
This can then for example be used in the forecast packages auto.arima() function. For the non-seasonal series:
mod1 <- auto.arima(x1, seasonal = isSeasonal(x1))
summary(mod1)
## Series: x1
## ARIMA(3,1,2)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2
## 0.9307 0.0547 -0.3898 -1.7288 0.8656
## s.e. 0.1400 0.1362 0.1129 0.1118 0.1122
##
## sigma^2 estimated as 1.034: log likelihood=-134.65
## AIC=281.29 AICc=282.25 BIC=296.62
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1724469 0.9845776 0.770701 0.160464 0.7525125 0.632579
## ACF1
## Training set -0.03608056
Plotting the forecasting:
plot(forecast(mod1, h = 12))
For the seasonal series:
mod2 <- auto.arima(x2, seasonal = isSeasonal(x2))
summary(mod2)
## Series: x2
## ARIMA(0,1,1)(1,0,0)[12]
##
## Coefficients:
## ma1 sar1
## -0.8642 0.3531
## s.e. 0.0502 0.1052
##
## sigma^2 estimated as 1.585: log likelihood=-157.09
## AIC=320.17 AICc=320.43 BIC=327.83
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.2134882 1.239136 0.9547015 0.1950671 0.9304067 0.777594
## ACF1
## Training set 0.03000515
Plot the data:
plot(forecast(mod2, h = 12))
Stationary is an important charateristic of time series. A time series is said to be stationary if its statistical properties do not change over time. In other words, it has constant mean and variance, and covariance is independent of time.
Looking again at the same plot, we see that the process above is stationary. The mean and variance do not vary over time.
Often real world data are not stationary. Ideally we want to have a stationary time series for modelling. As most of them are not stationary we can make transformations to make them stationary.
So How to check if a data is stationary?
Here we will look at a couple of methods for checking if the time series data is stationary.
Identify if correlation at different time lags goes to 0
First we need data. So we will generate couple of time series signals. one that we know is stationary (Gaussian noise) and one with a trend (cummulative sum of gaussian noise)
t = 0:300
y_stationary <- rnorm(length(t), mean = 1, sd = 1) #the stationary time series
y_trend <- cumsum(rnorm(length(t), mean = 1, sd = 4))+t/100 #time series with trend
#normalize the data
y_stationary <- y_stationary/max(y_stationary)
y_trend <- y_trend/max(y_trend)
we can check each for characteristics of stationarity by looking at the autocorrelation functions (ACF) of each signal. For a stationary signal, because we expect no dependence with time, we would expect the ACF to go to 0 for each time lag (\(\tau\))
plot.new()
frame()
par(mfcol=c(2,2))
# the stationary signal and ACF
plot(t,y_stationary,
type='l',col='red',
xlab = "time (t)",
ylab = "Y(t)",
main = "Stationary signal")
acf(y_stationary,lag.max = length(y_stationary),
xlab = "lag #", ylab = 'ACF',main=' ')
# the trend signal and ACF
plot(t,y_trend,
type='l',col='red',
xlab = "time (t)",
ylab = "Y(t)",
main = "Trend signal")
acf(y_trend,lag.max = length(y_trend),
xlab = "lag #", ylab = 'ACF', main=' ')
Notably, the stationary signal(top left) results in few significant lags that exceed the confidence interval of the ACF (blue dashed line). In comparison time series with trend result in almost all lags exceeding the cofidence interval of ACF. Quantitatively, we can see and conclude that from the ACFs that the signal on the left is stationary.
Another test we can conduct is the Augmented Dickey-Fuller (ADF) t-statistics test to find if the series has a unit root (a series with a trend line will have a unit root and result in a large p value)
options(warn=-1)
adf.test(y_stationary)
##
## Augmented Dickey-Fuller Test
##
## data: y_stationary
## Dickey-Fuller = -5.7832, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
adf.test(y_trend)
##
## Augmented Dickey-Fuller Test
##
## data: y_trend
## Dickey-Fuller = -1.9298, Lag order = 6, p-value = 0.6059
## alternative hypothesis: stationary
Lastly, we can test if the time series is level or trend stationary using the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. Here we will test the null hypothesis of trend stationarity (a low p-value will indicate a signal that is not trend stationary, has a unit root):
kpss.test(y_stationary, null="Trend")
##
## KPSS Test for Trend Stationarity
##
## data: y_stationary
## KPSS Trend = 0.055525, Truncation lag parameter = 5, p-value = 0.1
kpss.test(y_trend, null="Trend")
##
## KPSS Test for Trend Stationarity
##
## data: y_trend
## KPSS Trend = 0.34143, Truncation lag parameter = 5, p-value = 0.01
Now we know all the before things we should know for modelling a time series now we will go to the modelling section.
There are many time series models now. Among them these 11 models are classical time series forecasting methods:
Here we will show the implementation of some of them:
The moving average model is probably the most naive approach to time time series modelling. This model states that the next observation is the mean of all past observations.
The method is suitable for univariate time series without trend and seasonal components.
A moving average term in a time series model is a past error (multiplied by a coefficient) Let \(W_t\) ~ N(0, \(\sigma^2_w\), meaning that the \(W_t\) are identically, independently distributed, each with a normal distribution having mean 0 and the same variance.
The \(1^{st}\) order moving average model denoted by MA(1) is:
\[x_t = \mu + w_t + \theta_1w_{t-1} \] The \(2^{nd}\) order moving average model, denoted by MA(2) is:
\[x_t = \mu + w_t + \theta_1w_{t-1} + \theta_2w_{t-2} \] The \(q^{th}\) order moving average model, denoted by MA(2) is:
\[x_t = \mu + w_t + \theta_1w_{t-1} + \theta_2w_{t-2} + .....+ \theta_qw_{t-q}\]
Theoretical Properties of a Time Series with an MA(1) Model:
Mean is \(E(x_t)\) = \(\mu\) Variance is \(Var(x_t)\) = \(\sigma^2_w\)\(1+\theta^2_1\) Autocorrelation function(ACF) is: \[\rho_1 = \frac{\theta_1}{1+\theta^2_1},and \ \rho_h = 0 \ for \ h>=2\]
Note: That the only nonzero value in the theoretical ACF is for lag 1. All other autocorrelations are 0. Thus a simple ACF with a significant autocorrelation only at lag 1 is an indicator of a possible MA(1) model.
Theoretical Properties of a Time Series with an MA(2) Model For the MA(2) model, theoretical properties are the following: Mean is \(E(x_t)\) = \(\mu\) Variance is \(Var(x_t)\) = \(\sigma^2_w\)\((1+\theta^2_1 + \theta^2_2)\) Autocorrelation function(ACF) is: \[\rho_1 = \frac{\theta_1 + \theta_1\theta_2}{1+\theta^2_1+\theta^2_2},and \ \rho_h = 0 \ for \ h>=3\] Note: That the only nonzero value in the theoretical ACF is for lags 1 and 2. So, a simple ACF with significant autocorrelations at lags 1 and 2, but non-significant autocorrelations for higher lags indicates a possible MA(2) model.
ACF for General MA(q) Models A property of MA(q) models in general is that there are nonzero autocorrelations for the first q lags and autocorrelations = 0 for all lags > q
Non-uniqueness of connection between values of \(\theta_1\) and \(\rho_1\) in MA(1) Model.
In the MA(1) model for any value of \(\theta_1\), the reciprocal 1/\(\theta_1\) gives the same value for: \[\rho_1 = \frac{\theta_1}{1+\theta^2_1}\] To satisfy a theoretical restriction called invertibility, we restrict MA(1) models to have values with absolute value less than 1. ### Practical with Moving Average
Implementation with MA(1)
#simulate MA(1) with theta 0.8
ma1 <- arima.sim(list(ma = c(0.8)), n=500)
ts.plot(ma1, main = "MA(1) with theta 0.8")
acf(ma1, main = "ACF MA(1) with theta 0.8")
From the graph of ACF we can determine the order of the MA. In this case, the graph shows the cut off after lag 1
#Simulating MA(1) using theta -0.8
ma1 <- arima.sim(list(ma = c(-0.8)), n=500)
ts.plot(ma1, main = "MA(1) with theta -0.8")
acf(ma1, main = "ACF MA(1) with theta -0.8")
So we can see that the lag plotted is in the -ve side of the plane.
Implementation with MA(2)
#simulate MA(2) with theta1 0.2 and theta2 0.7
ma2 <- arima.sim(list(ma = c(0.2,0.7)), n=500)
ts.plot(ma2, main = "Simulated MA(2) with theta1 = 0.2 theta2 = 0.7")
acf(ma2, main = "ACF MA(2) with theta1 = 0.2 theta2 = 0.7")
Analysis using MA(q) model We will use the stock data we used before during autocorrelation testing.
stock_data <- read.csv("stock_data.csv")
aapl_prices <- stock_data$AAPL
aapl_prices_ts <- ts(aapl_prices) #converting it to a time series
plot.ts(aapl_prices_ts,main="Stock Prices",ylab="Prices")
acf(aapl_prices_ts, main = "Acf for stock prices")
Since we see that the ACF plot is not cutting off, this means that the model is not stationary. Hence, We have to difference the data.
#determine the number of differences
ndiffs(aapl_prices_ts)
## [1] 1
#diffencing with the number of difference given
diff(diff(aapl_prices_ts))
## Time Series:
## Start = 3
## End = 251
## Frequency = 1
## [1] -0.740 0.390 -0.910 -1.190 1.440 -0.130 0.330 -0.440 0.140 -0.990
## [11] 2.470 -1.930 1.060 -1.010 0.050 -0.620 -0.240 -0.550 0.650 6.950
## [21] -4.890 -1.520 1.970 -3.410 2.880 -1.230 1.530 -0.720 -0.450 -1.250
## [31] 0.740 0.320 1.050 -1.400 -0.060 0.020 0.420 -1.130 1.190 -1.160
## [41] 0.360 -0.170 0.510 -0.700 0.920 0.530 0.370 -1.030 0.690 -3.500
## [51] 0.450 4.700 0.200 1.310 -0.020 -4.450 -0.690 1.330 -0.010 1.090
## [61] -2.980 2.080 0.040 0.650 -2.630 2.640 -1.400 1.010 -0.430 0.790
## [71] -0.670 1.820 -1.740 0.790 -1.400 1.010 -0.730 0.000 -0.270 0.290
## [81] -0.400 1.510 -0.450 -3.260 1.550 0.350 0.580 -1.870 2.150 -1.860
## [91] 0.770 2.560 -0.920 -0.830 -2.910 3.730 -3.360 4.120 1.480 -2.920
## [101] 0.150 1.560 -1.600 -0.640 1.130 -0.780 0.110 -0.830 -0.090 1.440
## [111] -1.200 1.630 0.240 0.010 0.740 -2.480 2.540 -1.890 0.630 -0.480
## [121] 0.520 -0.360 -0.200 -0.880 1.000 0.510 -1.240 0.470 -0.880 1.240
## [131] -0.460 0.720 0.710 -0.220 -0.960 0.520 -1.140 0.290 1.170 -0.970
## [141] -0.200 0.430 -0.140 -0.190 2.020 -1.850 -0.050 -0.330 0.040 7.680
## [151] -7.620 0.770 0.660 0.030 -0.730 -0.130 -0.680 1.470 0.560 -1.240
## [161] -0.655 0.540 0.605 -0.570 -0.990 0.710 0.140 -0.210 2.740 -3.630
## [171] 1.650 -1.260 0.620 -0.700 0.200 0.780 -0.400 -0.270 1.680 -1.240
## [181] -0.930 2.170 -3.090 3.200 -2.080 0.220 0.520 2.680 -2.600 -0.510
## [191] -0.080 0.310 1.030 -1.820 0.390 0.040 0.150 -1.370 1.710 -0.920
## [201] 1.530 -1.410 0.110 2.280 -1.930 1.540 -0.480 -1.740 0.960 -0.250
## [211] 3.070 -2.000 -1.380 -0.080 2.960 1.620 -3.070 -1.710 1.420 1.460
## [221] -2.550 0.170 -4.990 7.510 -1.770 0.410 -1.120 -0.270 0.990 -0.790
## [231] 0.320 -0.970 1.330 1.850 -3.790 2.040 0.400 -1.300 -5.630 2.450
## [241] 4.730 -2.600 0.560 -1.150 6.090 -5.400 2.190 -1.100 0.890
ts.plot(diff(diff(aapl_prices_ts)), ylab = "Difference", main = "1st difference data")
So now we can see less variance in the data.
acf(diff(diff(aapl_prices_ts)), main = "ACF for 1st difference data")
acf(diff(diff(aapl_prices_ts)), plot = F)
##
## Autocorrelations of series 'diff(diff(aapl_prices_ts))', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 -0.483 0.025 -0.108 0.071 0.041 -0.060 0.009 -0.033 0.101 -0.069
## 11 12 13 14 15 16 17 18 19 20 21
## 0.029 -0.059 0.061 0.001 -0.078 0.061 0.015 -0.060 0.065 -0.051 0.098
## 22 23
## -0.106 0.072
As you can see ACF plot cuts off after 2nd lag.
We will proceed with 2nd order testing
Ma2 <- arima(aapl_prices_ts, order = c(0,2,3))
#testing the model whether it fits or not
test <- Box.test(Ma2$residuals, lag = 12, type = "Ljung-Box", fitdf = 3)
checkresiduals(Ma2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,2,3)
## Q* = 5.6075, df = 7, p-value = 0.5863
##
## Model df: 3. Total lags used: 10
qchisq(0.95,7)
## [1] 14.06714
Now since the p value is bigger than 0.05 and Q* value is smaller than chi square value so fitness of the model is confirmed.
Now will forecast:
forecast(Ma2, h = 10)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 252 146.5097 144.7268 148.2927 143.7829 149.2365
## 253 146.7333 144.1526 149.3139 142.7865 150.6800
## 254 146.9506 143.7473 150.1540 142.0515 151.8497
## 255 147.1680 143.4408 150.8951 141.4678 152.8681
## 256 147.3853 143.1961 151.5746 140.9785 153.7922
## 257 147.6027 142.9944 152.2109 140.5550 154.6504
## 258 147.8200 142.8250 152.8151 140.1807 155.4593
## 259 148.0374 142.6807 153.3941 139.8451 156.2297
## 260 148.2548 142.5570 153.9526 139.5407 156.9688
## 261 148.4721 142.4501 154.4941 139.2623 157.6820
autoplot(forecast(Ma2, h=10), ylab= "price")
ARIMA(Autoregressive Integrated Moving Average) is a generalization of autoregressive moving average (ARMA) model. Autoregressive terms refer to the lags of the differenced series, Moving Average terms refer to the lags of errors and I is the number of difference used to make the time series stationary.
White noise series: A time series is white noise if the variables are independent and identically distributed with a mean of zero. This means that all variables have the same variance (\(\sigma^2\)) and each value has a zero correlation with all other values in the series.
Steps to followed for ARIMA modeling:
This time also we will use the stock data to analyse using Arima Model
data("AirPassengers")
AP <- AirPassengers
str(AP) #time series data
## Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...
head(AP)
## Jan Feb Mar Apr May Jun
## 1949 112 118 132 129 121 135
ts(AP, frequency = 12, start = c(1949,1))
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
attributes(AP)
## $tsp
## [1] 1949.000 1960.917 12.000
##
## $class
## [1] "ts"
plot(AP)
So from the graph we can see that this time series is not stationary. We will analyze all of these in exploratory analysis:
Autocorrelation analysis to examine serial dependence: Used to estimate which value in the past has a correlation with the current value. Provides the p,d,q estimate for ARIMA models.
Spectral Analysis to examine cyclic behavior: Carried out to describe how variation in a time series may be accounted for by cyclic components. Also referred to as a Frequency Domain analysis. Using this, periodic components in a noisy environment can be separated out.
Trend estimation and decomposition: Used for seasonal adjustment. It seeks to construct, from an observed time series, a number of component series(that could be used to reconstruct the original series) where each of these has a certain characteristic.
Before performing any EDA on the data we need to understand three componenets of the time series.
We can use the following R code to find out the components of this time series:
Basically we will decompose the data set but before doing that we can see that in our dataset there are too much fluctuation in year to year data. So we will use log transformation to omit that issue.
# Log transform
AP_log <- log(AP)
plot(AP_log)
Now we will decompose and also see the seasonality in depth.
components.ts = decompose(AP_log)
components.ts$figure
## [1] -0.085815019 -0.114412848 0.018113355 -0.013045611 -0.008966106
## [6] 0.115392997 0.210816435 0.204512399 0.064836351 -0.075271265
## [11] -0.215845612 -0.100315075
plot(components.ts$figure,
type = 'b',
xlab = "Month",
ylab = "Seasonality Index",
col = "blue",
las = 2)
plot(components.ts)
According to seasonality Index you can see comparing to average travel in june, july, august it has more volume.
So after observing the decomposition of the data we will get 4 components: Observed – the actual data plot Trend – the overall upward or downward movement of the data points Seasonal – any monthly/yearly pattern of the data points Random – unexplainable part of the data
Now we need to remove the non-stationary part for ARIMA.
To achieve stationarity:
difference the data- compute the differences between consecutive observations
log or square root the series data to stabilize non-constant variance
Unit root test- This test is used to find out the first difference or regression which should be used on the trending to make it stationary. In kwiatkowski-Phillips-Schmidt-Shin (KPSS) test, small p-values suggest differencing is required.
So we have already seen the log test. Now we will see how to do unit root test.
library("fUnitRoots")
## Loading required package: timeDate
## Loading required package: timeSeries
## Loading required package: fBasics
urkpssTest(AP, type = c("tau"), lags = c("short"),use.lag = NULL, doplot = TRUE)
##
## Title:
## KPSS Unit Root Test
##
## Test Results:
## NA
##
## Description:
## Sat Sep 26 00:26:44 2020 by user:
APstationary = diff(AP, differences=1)
plot(APstationary)
Now calculating the autocorrelation:
acf(AP, lag.max = 34)
As we can infer from the graph above, the autocorrelation continues to decrease as the lag increases, confirming that there is no linear association between observations separated by larger lags.
timeseriesseasonallyadjusted <- AP_log - components.ts$seasonal
APstationary <- diff(timeseriesseasonallyadjusted, differences=1)
plot(APstationary)
So here we have dealt with the seasonality.
Once the data is ready and satisfies all the assumptions of modelling, to determine the order of the model to be fitted to the data, we need three variables: p, d, and q which are non-negative integers that refer to the order of the autoregressive, integrated, and moving average parts of the model respectively.
To examine which p and q values will be appropriate we need to run acf() and pacf() function. pacf() at lag k is autocorrelation function which describes the correlation between all data points that are exactly k steps apart- after accounting for their correlation with the data between those k steps. It helps to identify the number of autoregression (AR) coefficients(p-value) in an ARIMA model.
acf(APstationary, lag.max = 34)
pacf(APstationary, lag.max = 34)
Shape of acf() to define values of p and q: Looking at the graphs and going through the table we can determine which type of the model to select and what will be the values of p, d and q.
Here is a table to which model to choose:
fitARIMA <- arima(AP_log, order=c(1,1,1),seasonal = list(order = c(1,0,0), period = 12),method="ML")
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following object is masked from 'package:timeSeries':
##
## time<-
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
coeftest(fitARIMA)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.435105 0.349954 1.2433 0.213750
## ma1 -0.725533 0.280445 -2.5871 0.009679 **
## sar1 0.920483 0.029345 31.3675 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(fitARIMA)
## 2.5 % 97.5 %
## ar1 -0.2507928 1.1210035
## ma1 -1.2751954 -0.1758702
## sar1 0.8629672 0.9779979
Here we can see that sar1 has more significance.
R uses maximum likelihood estimation (MLE) to estimate the ARIMA model. It tries to maximize the log-likelihood for given values of p, d, and q when finding parameter estimates so as to maximize the probability of obtaining the data that we have observed.
It is a test of independence at all lags up to the one specified. Instead of testing randomness at each distinct lag, it tests the “overall” randomness based on a number of lags, and is therefore a portmanteau test. It is applied to the residuals of a fitted ARIMA model, not the original series, and in such applications the hypothesis actually being tested is that the residuals from the ARIMA model have no autocorrelation.
acf(fitARIMA$residuals)
library(FitAR)
## Loading required package: lattice
## Loading required package: leaps
## Loading required package: ltsa
## Loading required package: bestglm
##
## Attaching package: 'FitAR'
## The following object is masked from 'package:forecast':
##
## BoxCox
boxresult <- LjungBoxTest (fitARIMA$residuals,k=2,StartLag=1)
plot(boxresult[,3],main= "Ljung-Box Q Test", ylab= "P-values", xlab= "Lag")
qqnorm(fitARIMA$residuals)
qqline(fitARIMA$residuals)
As all the graphs are in support of the assumption that there is no pattern in the residuals, we can go ahead and calculate the forecast.
Work flow Diagram
auto.arima() function: The forecast package provides two functions: ets() and auto.arima() for the automatic selection of exponential and ARIMA models.
The auto.arima() function in R uses a combination of unit root tests, minimization of the AIC and MLE to obtain an ARIMA model.
auto.arima(AP_log, trace=TRUE)
##
## ARIMA(2,1,2)(1,1,1)[12] : Inf
## ARIMA(0,1,0)(0,1,0)[12] : -434.799
## ARIMA(1,1,0)(1,1,0)[12] : -474.6299
## ARIMA(0,1,1)(0,1,1)[12] : -483.2101
## ARIMA(0,1,1)(0,1,0)[12] : -449.8857
## ARIMA(0,1,1)(1,1,1)[12] : -481.5957
## ARIMA(0,1,1)(0,1,2)[12] : -481.6451
## ARIMA(0,1,1)(1,1,0)[12] : -477.2164
## ARIMA(0,1,1)(1,1,2)[12] : Inf
## ARIMA(0,1,0)(0,1,1)[12] : -467.4644
## ARIMA(1,1,1)(0,1,1)[12] : -481.582
## ARIMA(0,1,2)(0,1,1)[12] : -481.2991
## ARIMA(1,1,0)(0,1,1)[12] : -481.3006
## ARIMA(1,1,2)(0,1,1)[12] : -481.5633
##
## Best model: ARIMA(0,1,1)(0,1,1)[12]
## Series: AP_log
## ARIMA(0,1,1)(0,1,1)[12]
##
## Coefficients:
## ma1 sma1
## -0.4018 -0.5569
## s.e. 0.0896 0.0731
##
## sigma^2 estimated as 0.001371: log likelihood=244.7
## AIC=-483.4 AICc=-483.21 BIC=-474.77
Predicting using ARIMA function
predict(fitARIMA,n.ahead = 5)
## $pred
## Jan Feb Mar Apr May
## 1961 6.107084 6.052951 6.118845 6.207747 6.229875
##
## $se
## Jan Feb Mar Apr May
## 1961 0.04204779 0.05155775 0.05709252 0.06126065 0.06482674
futurVal <- forecast(fitARIMA,h=12, level=c(99.5))
plot(futurVal)
The forecasts are shown as a blue line, with the 80% prediction intervals as a dark shaded area, and the 95% prediction intervals as a light shaded area.
This is the overall process by which we can analyze time series data and forecast values from existing series using ARIMA.